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Abstract 

In recent work we have presented a formal 
framework for linguistic annotation based on 
labeled acyclic digraphs. These 'annotation graphs' 
offer a simple yet powerful method for representing 
complex annotation structures incorporating 
hierarchy and overlap. Here, we motivate and 
illustrate our approach using discourse-level 
annotations of text and speech data drawn from 
the CALLHOME, COCONUT, MUC-7, DAMSL 
and TRAINS annotation schemes. With the help 
of domain specialists, we have constructed a hybrid 
multi-level annotation for a fragment of the Boston 
University Radio Speech Corpus which includes 
the following levels: segment, word, breath, ToBI, 
Tilt, Treebank, coreference and named entity. We 
show how annotation graphs can represent hybrid 
multi-level structures which derive from a diverse 
set of file formats. We also show how the approach 
facilitates substantive comparison of multiple 
annotations of a single signal based on different 
theoretical models. The discussion shows how 
annotation graphs open the door to wide-ranging 
integration of tools, formats and corpora. 

1 Annotation Graphs 

When we examine the kinds of speech transcription 
and annotation found in many existing 'communi- 
ties of practice', we see commonality of abstract 
form along with diversity of concrete format. Our 



surve y of annotation practice (Bird and Liberman. 
1999] ) attests to this commonality amidst diversity 



(See [www.ldc.upenn.edu/armotation] for pointers to 
online material.) We observed that all annotations 
of recorded linguistic signals require one unavoidable 
basic action: to associate a label, or an ordered 
sequence of labels, with a stretch of time in the 
recording(s). Such annotations also typically distin- 
guish labels of different types, such as spoken words 
vs. non-speech noises. Different types of annota- 
tion often span different-sized stretches of recorded 
time, without necessarily forming a strict hierarchy: 
thus a conversation contains (perhaps overlapping) 



conversational turns, turns contain (perhaps inter- 
rupted) words, and words contain (perhaps shared) 
phonetic segments. Some types of annotation are 
systematically incom mensurable w ith others: thus 
disflue ncy structures ([Taylor, 1995 ) and focus struc- 
tures ( Jackcndoff, 1972 ) often cut across conversa- 
tional turns and syntactic constituents. 

A minimal formalization of this basic set of prac- 
tices is a directed graph with fielded records on the 
arcs and optional time references on the nodes. We 
have argued that this minimal formalization in fact 
has sufficient expressive capacity to encode, in a 
reasonably intuitive way, all of the kinds of linguis- 
tic annotations in use today. We have also argued 
that this minimal formalization has good properties 
with respect to creation, maintenance and searching 
of annotations. We believe that these advantages 
are especially strong in the case of discourse anno- 
tations, because of the prevalence of cross-cutting 
structures and the need to compare multiple anno- 
tations representing different purposes and perspec- 
tives. 

Translation into annotation graphs does not mag- 
ically create compatibility among systems whose 
semantics are different. For instance, there are many 
different approaches to transcribing filled pauses in 
English - each will translate easily into an annota- 
tion graph framework, but their semantic incompati- 
bility is not thereby erased. However, it does enable 
us to focus on the substantive differences without 
having to be concerned with diverse formats, and 
without being forced to recode annotations in an 
agreed, common format. Therefore, we focus on the 
structure of annotations, independently of domain- 
specific concerns about permissible tags, attributes, 
and values. 

As reference corpora are published for a wider 
range of spoken language genres, annotation 
work is increasingly reusing the same primary 
data. For instance, the Switchboard corpus 
[www . ldc . upenn . edu/Cat alog/LDC93S7 . html] has 



been marked up for disfluency ( Taylor, 1995 ). 
See [www. cis .upenn. edu/~treebank/ switchboard- 
sample .html] for an example, which also includes a 



separate part-of-sp eech annotation and a Treebank - 
style annotation. Hirschman and Chinchor (1997 ) 



The minimal annotation graph for this structure is 
as follows: 



give an example of MUC-7 coreference annotation 
applied to an existing TRAINS dialog annotation 
marking speaker turns and overlap. We shall 
encounter a number of such cases here. 

The Formalism 

As we said above, we take an annotation label to 
be a fielded record. A minimal but sufficient set of 
fields would be: 

type this represents a level of an annotation, such 
as the segment, word and discourse levels; 

label this is a contentful property, such as a par- 
ticular word, a speaker's name, or a discourse 
function; 

class this is an optional field which permits the 
arcs of an annotation graph to be co-indexed 
as members of an equivalence class. [] 

One might add further fields for holding comments, 
annotator id, update history, and so on. 

Let T be a set of types, L be a set of labels, and 
C be a set of classes. Let R — {(t, I, c) | t € T, I € 
L,c£ C}, the set of records over T, L, C. Let N be 
a set of nodes. Annotation graphs (AGs) are now 
defined as follows: 

Definition 1 An annotation graph G over R, N 
is a set of triples having the form {n\,r, U2), t G R, 
7ij.,7i2 € N, which satisfies the following conditions: 

1. (N, {(ni, n%) | (ni,r, 712) G A}) is a labelled 
acyclic digraph. 

2. t : N — 1 5ft is an order-preserving map assigning 
times to (some of) the nodes. 

For detailed discussion of these structures, see 
( Bird and Liberman, 1999 K Here we present a frag- 
ment (taken from Figure | below) to illustrate the 
definition. For convenience the components of the 
fielded records which decorate the arcs are separated 
using the slash symbol. The example contains two 
word arcs, and a discourse tag encoding 'influence 
on speaker'. No class fields are used. Not all nodes 
have a time reference. 



W/oh/_____—^ 
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~-_____W/okay/ 
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_______ D/IOS:Commit/ J 


53.14 
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{W,D} 

{oh, okay, IOS : Commit} 



{1,2,3} 

{(1,52.46), (3,53.14)} 

(l,W/oh/,2) , 
(2,W/okay/,3) , 
(l,D/I0S:Commit/,3) 



XML is a natural 'surface representation' for 
annotation graphs and could provide the primary 
exchange format. A particularly simple XML 
encoding of the above structure is shown below; 
one might choose to use a richer XML encoding in 
practice. 



<amiotation> 
<arc> 

<begin id=l time=52.46> 

<label type="W" name="oh"> 

<end id=2> 
</arc> 
<arc> 

<begin id=2> 

<label type="W" name="okay"> 

<end id=3 time=53 . 14> 
</arc> 
<arc> 

<begin id=l time=52.46> 
<label type="D" name=" IQS : Commit " 
<end id=3 time=53.14> 
</arc> 
</ annotation> 



1 We have avoided using explicit pointers since we prefer 
not to associate formal identifiers to the arcs. Equivalence 
classes will be exemplified later. 



2 AGs and Discourse Markup 



2.1 LDC Telephone Speech Transcripts 

The LDC-published CALLHOME corpora include 
digital audio, transcripts and lexicons for telephone 
conversations in several languages, and are 
designed to support research on speech recognition 
[www . ldc . upenn . edu/Catalog/LDC96S46 . html] . The 
transcripts exhibit abundant overlap between 
speaker turns. What follows is a typical fragment 
of an annotation. Each stretch of speech consists of 
a begin time, an end time, a speaker designation, 
and the transcription for the cited stretch of time. 
We have augmented the annotation with + and * 
to indicate partial and total overlap (respectively) 
with the previous speaker turn. 



W/helpful 



J 994.65 cJ 995.21 



-B- 



Figure 1: Graph Structure for LDC Telephone Speech Example 
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Figure 2: Visualization for LDC Telephone Speech Example 



yb2.bd A: he was changing projects every couple 

of weeks and he said he couldn't keep on top of it. 
He couldn't learn the whole new area 



* 968.71 969.00 B 
970.35 971.94 A 

* 971.23 971.42 B 
972.46 979.47 A 



°/ mm . 

that fast each time. 
°/ mm . 

y o um, and he says he went in and had some 
tests, and he was diagnosed as having attention deficit 
disorder. Which 
980.18 989.56 A: you know, given how he's how far he's 
gotten, you know, he got his degree at &Tufts and all, 
I found that surprising that for the first time as an 
adult they're diagnosing this. °/ e um 



%mm . I wonder about it . But anyway . 
yeah, but that's what he said. And °/ um 
yeah. 
He %um 

Whatever 's helpful. 
Right. So he found this new job as a 
financial consultant and seems to be happy with that. 
1003.14 1003.45 B: Good. 



+ 989.42 991.86 B 
+ 991.75 994.65 A 
* 994.19 994.46 B 
995.21 996.59 A 
+ 996.51 997.61 B 
+ 997.40 1002.55 A 



and Mcllish, 1989| , 179ff). Thus, we make a logi- 



Long turns (e.g. the period from 972.46 to 989.56 
seconds) were broken up into shorter stretches for 
the convenience of the annotators and to provide 
additional time references. A section of this anno- 
tation which includes an example of total overlap is 
represented in annotation graph form in Figure [l], 
with the accompanying visualization shown in Fig- 
ure p[ (We have no commitment to this particular 
visualization; the graph structures can be visualized 
in many ways and the perspicuity of a visualization 
format will be somewhat domain-specific.) 

The turns are attributed to speakers using the 
speaker/ type. All of the words, punctuation and 
disfiuencies are given the W/ type, though we could 
easily opt for a more refined version in which these 
are assigned different types. The class field is not 
used here. Observe that each speaker turn is a dis- 
joint piece of graph structure, and that hier archical 
organisation uses the 'chart construction' (Gazdar 



cal distinction between the situation where the end- 
points of two pieces of annotation necessarily coin- 
cide (by sharing the same node) from the situation 
where endpoints happen to coincide (by having dis- 
tinct nodes which contain the same time reference). 
The former possibility is required for hierarchical 
structure, and the latter possibility is required for 
overlapping speaker turns where words spoken by 
different speakers may happen to sharing the same 
boundary. 

2.2 Dialogue Annotation in COCONUT 

The COCONUT corpus is a set of dialogues in which 
the two conversants collaborate on a task of deciding 



what furniture to buy for a house ( Di Eugenio et al 



1998). The coding scheme a ugments the DAMSL 



scheme (Allen and Core, 1997) by having some new 
top-level tags and by further specifying some exist- 
ing tags. An example is given in Figure ^. 

The example shows five utterance pieces, identi- 
fied (a-e), four produced by speaker SI and one pro- 
duced by speaker S2. The discourse annotations can 
be glossed as follows: Accept - the speaker is agreeing 
to a possible action or a claim; Commit - the speaker 
potentially commits to intend to perform a future 
specific action, and the commitment is not contin- 
gent upon the assent of the addressee; Offer - the 
speaker potentially commits to intend to perform a 
future specific action, and the commitment is contin- 
gent upon the assent of the addressee; Open-Option 
- the speaker provides an option for the addressee's 
future action; Action-Directive - the utterance is 
designed to cause the addressee to undertake a spe- 
cific action. 

In utterance (e) of Figure ||, speaker SI simul- 
taneously accepts to the meta-action in (d) of not 



Accept, Commit SI: 
Open-Option 

Action-Directive S2: 

Accept(d), Offer, Commit SI: 



(a) Let's take the blue rug for 250, 

(b) my rug wouldn't match 

(c) which is yellow for 150. 

(d) we don't have to match. . . 

(e) well then let's use mine for 150 



Figure 3: Dialogue with COCONUT Coding Scheme 
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SI 



Commit 
Accept 



Utt/ Let's take the blue rug for 250 



50 , /a | my rug wouldn't match /b j whi< 



Open -Option 

my rug wouldn't match lb I which is yellow for 150 . /c 



Action-Directive 



we don't have to match ... Id 



SI 

Accept Id 
Offer 
Commit 

well then let's use mine for 150 /e 



Figure 4: Visualization of Annotation Graph for COCONUT Example 



having matching colors, and to the regular action of 
using Si's yellow rug. The latter acceptance is not 
explicitly represented in the original notation, so we 
shall only consider the former. 

In representing this dialogue structure using anno- 
tation graphs, we will be concerned to achieve the 
following: (i) to treat multiple annotations of the 
same utterance fragment as an unordered set, rather 
than a list, to simplify indexing and query; (ii) to 
explicitly link speaker SI to utterances (a-c); (hi) 
to formalize the relationship between Accept (d) and 
utterance (d); and (iv) formalize the rest of the 
annotation structure which is implicit in the textual 
representation. 

We adopt the types Sp (speaker), Utt (utterance) 
and D (discourse). A more refined type system 
could include other levels of representation, it could 
distinguish forward versus backward communicative 
function, and so on. For the names we employ: 
speaker identifiers SI, S2; discourse tags Offer, 
Commit, Accept, Open-Option, Action-Directive; and 
orthographic strings representing the utterances. 
For the classes (the third, optional field) we employ 
the utterance identifiers a, b, c, d, e. 

An annotation graph representation of the 
COCONUT example can now be represented as in 
Figure ^. The arcs are structured into three layers, 
one for each type, where the types are written on 
the left. If the optional class field is specified, this 
information follows the name field, separated by a 
slash. The Accept/d arc refers to the S2 utterance 
simply by virtue of the fact that both share the 
same class field. 

Observe that the Commit and Accept tags for (a) 
are unordered, unlike the original annotation, and 
that speaker SI is associated with all utterances (a- 
c), rather than being explicitly linked to (a) and 
implicitly linked to (b) and (c) as in Figure \ 



To make the referent of the Accept tag clear, we 
make use of the class field. Recall that the third 
component of the fielded records, the class field, per- 
mits arcs to refer to each other. Both the referring 
and the referenced arcs are assigned to equivalence 
class d. 



2.3 Coreference Annotation in MUC-7 

The MUC-7 Message Understanding Conference 
specified tasks for information extraction, named 
entity and coreference. Coreferring expressions 
are to be linked u sing SGML markup with 
ID a nd REF tags ( |Hirschman and Chinchor, 
1997). Figure || is a sample of text from 
the Boston University Radio Speech Corpus 



[www. ldc .upenn.edu/Catalog/LDC96S36 .html], 

marked up with coreference tags. (We are grateful 
to Lyncttc Hirschman for providing us with this 
annotation.) 

Noun phrases participating in coreference are 
wrapped with <coref>. . .</coref> tags, which can 
bear the attributes ID, REF, TYPE and MIN. Each such 
phrase is given a unique identifier, which may be 
referenced by a REF attribute somewhere else. Our 
example contains the following references: 3^2, 
4 ->■ 2, 6 5, 7 5, 8 ->■ 5, 12 ->■ 11, 15 -► 13. 
The TYPE attribute encodes the relationship between 
the anaphor and the antecedent. Currently, only 
the identity relation is marked, and so coreferences 
form an equivalence class. Accordingly, our example 
contains the following equivalence classes: {2,3,4}, 
{5,6,7,8}, {11,12}, {13,15}. 

In our AG representation we choose the first num- 
ber from each of these sets as the identifier for the 
equivalence class. MUC-7 also contains a specifica- 
tion for named entity annotation. Figure ^ gives an 
example, to be discussed in §3.2. This uses empty 



<COREF ID="2" MIN="woman"> 

This woman 
</C0REF> 

receives three hundred dollars a 
month under 
<C0REF ID="5"> 

General Relief 
</C0REF> 
, plus 

<C0REF ID="16" 

MIN="four hundred dollars"> 
four hundred dollars a month in 
<CDREF ID="17" 

MIN="benefits" REF="16"> 
A.F.D.C. benefits 
</COREF> 
</C0REF> 
for 

<C0REF ID="9" MIN="son"> 

<CDREF ID="3" REF="2"> 
her 

</C0REF> 

son 
</C0REF> 
, who is 



<C0REF ID="10" MIN="citizen" REF="9"> 

a U.S. citizen 
</C0REF> . 

<C0REF ID="4" REF="2"> 

She 
</C0REF> 
' s among 

<C0REF ID="18" MIN="aliens"> 

an estimated five hundred illegal 
aliens on 

<C0REF ID="6" REF="E"> 

General Relief 
</CDREF> 
out of 

<C0REF ID="11" MIN="population"> 
<C0REF ID="13" MIN="state"> 

the state 
</CDREF> 

's total illegal immigrant 

population of 

<C0REF ID="12" REF="11"> 

one hundred thousand 
</C0REF> 
</C0REF> 
</C0REF> 



<C0REF ID="7" REF="5"> 

General Relief 
</COREF> 

is for needy families and unemployable 
adults who don't qualify for other public 
assistance. Welfare Department spokeswoman 
Michael Reganburg says 
<C0REF ID="15" MIN="state" REF="13"> 

the state 
</C0REF> 

will save about one million dollars a year if 
<C0REF ID="20" MIN="aliens" REF="18"> 

illegal aliens 
</C0REF> 
are denied 

<C0REF ID="8" REF="5"> 

General Relief 
</C0REF> 



Figure 5: Coreference Annotation for BU Example 




Figure 6: Annotation Graph for Coreference Example 



tags to get around the problem of cross-cutting hier- 
archies. This problem doe s not arise in the annota- 
tion graph formalism; see ( Bird and Liberman, 199£ 
2.7). 

3 Hybrid Annotations 



There are many cases where a given corpus is anno- 
tated at several levels, from discourse to phonetics. 
While a uniform structure is sometim es imposed, 
as with Partitur ( Bchicl ct al., 199S| ), established 
practice and existing tools may give rise to corpora 
transcribed using different formats for different lev- 
els. Two examples of hybrid annotation will be dis- 
cussed here: a TRAINS+DAMSL annotation, and 
an eight-level annotation of the Boston University 
Radio Speech Corpus. 



3.1 DAMSL annotation of TRAINS 



The TRAINS corpus flHeeman and Allen, 1993| ) is a 
collection of about 100 dialogues containing a total 
of 5,900 speaker turns [www.ldc.upenn.edu/Catalog 
/LDC95S25.html]. Part of a transcript is shown 
below, where s and u designate the two speakers, 
<sil> denotes silent periods, and + denotes 
boundaries of speaker overlaps. 



uttl 


s : 


hello <sil> can I help you 


utt2 


u: 


yes <sil> urn <sil> I have a problem here 


utt3 




I need to transport one tanker of orange juice 
to Avon <sil> and a boxcar of bananas to 
Corning <sil> by three p.m. 


utt4 




and I think it's midnight now 


utt5 


s : 


uh right it's midnight 


utt6 


u: 


okay so we need to <sil> 

um get a tanker of 0J to Avon is the first 
thing we need to do 


utt7 




+ so + 


utt8 


s : 


+ okay + 


utt9 




<click> so we have to make orange juice first 


uttlO 


u: 


mm-hm <sil> okay so we're gonna pick up <sil> 
an engine two <sil> from Elmira 


uttll 




go to Corning <sil> pick up the tanker 


uttl2 


s : 


mm-hm 


uttl3 


u: 


go back to Elmira <sil> to get <sil> pick up 
the orange juice 


uttl4 


s : 


alright <sil> um well <sil> we also need to 
make the orange juice <sil> so we need to get 
+ oranges <sil> to Elmira + 


uttl5 


u: 


+ oh we need to pick up + oranges oh + okay + 


uttl6 


s : 


+ yeah + 


utt!7 


u: 


alright so <sil> engine number two is going to 
pick up a boxcar 



Accompanying this transcription are a number of 
xwaves label files containing time-aligned word-level 
and segment-level transcriptions. Below, the start of 
file speakerO. words is shown on the left, and the start 
of file speakerO. phones is shown on the right. The 
first number gives the file offset (in seconds), and the 
middle number gives the label color. The final part 



This woman receives 
<b_numex TYPE= "MONET" > 

three hundred dollars 
<e_numex> 

a month under General Relief, plus 
<b_numex TYPE=" MONEY "> 
four hundred dollars 
<e_numex> 

a month in A.F.D.C. benefits for her son, who is a 
<b_enamex TYPE="L0CATI0N"> 

U.S. 
<e_enamex> 

citizen. She's among an estimated five hundred illegal 
aliens on General Relief out of the state's total illegal 



immigrant population of one hundred thousand. General 
Relief is for needy families and unemployable adults 
who don't qualify for other public assistance. 
<b_enamex TYPE="ORGANIZATION"> 

Welfare Department 
<e_enamex> 
spokeswoman 

<b_enamex TYPE="PERSON"> 

Michael Reganburg 
<e_enamex> 

says the state will save about 
<b_numex TYPE="MONEY"> 

one million dollars 
<e_numex> 

a year if illegal aliens are denied General Relief. 



Figure 7: Named Entity Annotation for BU Example 



is a label for the interval which ends at the speci- 
fied time. Silence is marked explicitly (again using 
<sil>) so we can infer that the first word 'hello' occu- 
pies the interval [0.110000, 0.488555]. Evidently the 
segment-level annotation was done independently of 
the word-level annotation, and so the times do not 
line up exactly. 
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eh ; * 
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help 





530000 
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1 


068003 


122 


you 





570000 
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ih 


14 


670000 


122 


<sil> 





640000 


122 


n 


14 


920000 


122 


uh 





690000 


122 


ay 


15 


188292 


122 


right 





760000 


122 


hh 



The TRAINS annotations show the presence of 
backchannel cues and overlap. An example of over- 
lap is shown below: 
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130000 


122 


<sil> 










50 


260000 
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so 










50 


330000 


122 


we 










50 


480000 


122 


need 










50 


540000 


122 


to 










50 


651716 


122 


get 


















51 


094197 


122 


<sil> 










51 


306658 


122 


oh 


51 


360000 


122 


oranges 


















51 


410000 


122 


we 


51 


470000 


122 


<sil> 










51 


540000 


122 


to 


















51 


560000 


122 


need 










51 


620000 


122 


to 










51 


850000 


122 


pick 


51 


975728 


122 


Elmira 


















52 


020000 


122 


up 










52 


470000 


122 


oranges 










52 


666781 


122 


oh 


52 


807837 


76 


<sil> 


















52 


940000 


122 


okay 


53 


047996 


76 


yeah 


















53 


535600 


122 


<sil> 










53 


785600 


122 


alright 










54 


303529 


122 


so 



As seen in Figure H an d explained more fully in 



(Bird and Libcrman, 1999), overlap carries no impli- 
cations for the internal structure of speaker turns or 
for the position of turn-boundaries. 



Now, independently of this annotation there is 
also a dialogue annotation in DAMSL, as shown in 
Figure ||. Here, a dialog is broken down into turns 
and thence into utterances, where the tags contain 
discourse-level annotation. 

In representing this hybrid annotation as an AG 
we are motivated by the following concerns. First, 
we want to preserve the distinction between the 
TRAINS and DAMSL components, so that they can 
remain in their native formats (and be manipulated 
by their native tools) and be converted indepen- 
dently to AGs then combined using AG union, and 
so that they can be projected back out if necessary. 
Second, we want to identify those boundaries that 
necessarily have the same time reference (such as 
the end of utterance 17 and the end of the word 
'Elmira'), and represent them using a single graph 
node. Contributions from different speakers will 
remain disconnected in the graph structure. Finally, 
we want to use the equivalence class names to allow 
cross-references between utterances. A fragment of 
the proposed annotation graph is depicted using our 
visualization format in Figure |9| Observe that, for 
brevity, some discourse tags are not represented, and 
the phonetic segment level is omitted. 

Note that the tags in Figure || have the form of 
fielded records and so, according to the AG defini- 
tion, all the attributes of a tag could be put into 
a single label. We have chosen to maximally split 
such records into multiple arc labels, so that search 
predicates do not need to take account of inter- 
nal structure, and to limit the consequences of an 
erroneous code. A relevant analogy here is that of 
pre-composed versus compound characters in Uni- 
code. The presence of both forms of a character in 
a text raises problems for searching and collating. 
This problem is avoided through normalization, and 
this is typically done by maximally decomposing the 
characters. 

3.2 Multiple annotations of the BU corpus 

Linguistic analysis is always multivocal, in two 
senses. First, there are many types of entities and 



<Dialog Id=d92a-2.2 Annotation-date="08-14-97" Annotator="Reconciled Version" 
Speech="/d92a-2 . 2/dialog . f ea" Status=Verif ied> 

<Turn Id=T9 Speaker="s" Speech="-s 44.853889 -e 52.175728"> 

<Utt Id=uttl7 Agreement=None Inf luence-on-listener=Action-directive Inf luence-on-speaker=Commit Inf o-level=Task Response-to=" " 

Speech="-s 45.87 -e 52.175728" Statement=Assert> 
[sil] um well [sil] we also need to make the orange juice [sil] 
so we need to get + oranges [sil] to Elmira + 
<Turn Id=T10 Speaker="u" Speech="-s 51.106658 -e 53.14"> 

<Utt Id=uttl8 Agreement=Accept Inf luence-on-listener=Action-directive Inf luence-on-speaker=Commit Inf o-level=Task 

Response-to="uttl7" Speech="-s 51.106658 -e 52.67" Statement=Assert Understanding=SU-Acknowledge> 
+ oh we need to pick up + oranges 

<Utt Id=uttl9 Agreement=Accept Inf luence-on-speaker=Commit Inf o-level=Task Response-to="uttl7" Speech="-s 52.466781 -e 53.14" 

Understanding=Kone> 
oh + okay + 

<Turn Id=Tll Speaker="s" Speech="-s 52.047996 -e 53.247996"> 

<Utt Id=utt20 Agreement=Accept Inf o-level=Task Response-to="uttl8" Speech="-s 52.047996 -e 53.247996" Understanding=SU-Acknowledge> 
+ yeah + 

</Dialog> 



Figure 8: DAMSL Annotation of a TRAINS Dialogue 
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Figure 9: Graph Structure for TRAINS Example 



relations, on many scales, from acoustic features 
spanning a hundredth of a second to narrative 
structures spanning tens of minutes. Second, there 
are many alternative representations or construals 
of a given kind of linguistic information. 

Sometimes these alternatives are simply more 
or less convenient for a certain purpose. Thus a 
researcher who thinks theoretically of phonological 
features organized into moras, syllables and feet, 
will often find it convenient to use a phonemic 
string as a representational approximation. In 
other cases, however, different sorts of transcription 
or annotation reflect different theories about the 
ontology of linguistic structure or the functional 
categories of communication. 



The AG representation offers a way to deal pro- 
ductively with both kinds of multivocality. It pro- 
vides a framework for relating different categories of 
linguistic analysis, and at the same time to compare 
different approaches to a given type of analysis. 

As an example, Figure |Io| shows an AG- 
based visualization of eight different sorts of 
annotation of a phrase from the BU Radio 
Corpus, produced by Mari Ostendorf and others 
at Boston University, and published by the 
LDC [www . ldc . upenn . edu/Catalog/LDC96S36 . html] . 
The basic material is from a recording of a 
local public radio news broadcast. The BU 
annotations include four types of information: 
orthographic transcripts, broad phonetic transcripts 
(including main word stress), and two kinds 



of prosodic annotation, all time-aligned to the 
digital audio files. The two kinds of prosodic 
annotation implement the system known as ToBl 
[www . ling, ohio-st ate . edu/phonetics/E_ToBI/] . 
ToBI is an acronym for "Tones and Break 
Indices" , and correspondingly provides two types of 
information: Tones, which are taken from a fixed 
vocabulary of categories of (stress-linked) "pitch 
accents" and (juncture-linked) "boundary tones"; 
and Break Indices, which are integers characterizing 
the strength and nature of interword disjunctures. 

We have added four additional annota- 
tions: coreference annotation and named 
entity annotation in the style of MUC-7 
[www .muc . saic . com/proceedings/muc_7_toc .html] 
provided by Lynette Hirschman; syntactic structures 



in th e style of the Penn TreeBank ( Marcus et al 
1993| ) provided by Ann Taylor; and an alternative 



annot ation for the Fp aspects of prosody, known as 
Tilt (Taylor, 1998) and provided by its inventor, 
Paul Taylor. Taylor has done Tilt annotations for 
much of the BU corpus, and will soon be publishing 
them as a point of comparison with the ToBI tonal 
annotation. Tilt differs from ToBI in providing a 
quantitative rather than qualitative characterization 
of Fq obtrusions: where ToBI might say "this is a 
L+H* pitch accent," Tilt would say "This is an Fo 
obtrusion that starts at time to, lasts for duration d 
seconds, involves a Hz total Fo change, and ends I 
Hz different in Fo from where it started." 

As usual, the various annotations come in a bewil- 
dering variety of file formats. These are not entirely 
trivial to put into registration, because (for instance) 
the TreeBank terminal string contains both more 
(e.g. traces) and fewer (e.g. breaths) tokens than the 
orthographic transcription does. One other slightly 
tricky point: the connection between the word string 
and the "break indices" (which are ToBFs character- 
izations of the nature of interword disjuncture) are 
mediated only by identity in the floating-point time 
values assigned to word boundaries and to break 
indices in separate files. Since these time values are 
expressed as ASCII strings, it is easy to lose the 
identity relationship without meaning to, simply by 
reading in and writing out the values to programs 
that may make different choices of internal variable 
type (e.g. float vs. double), or number of decimal 
digits to print out, etc. 

Problems of this type are normal whenever multi- 
ple annotations need to be compared. Solving them 
is not rocket science, but does take careful work. 
When annotations with separate histories involve 
mutually inconsistent corrections, silent omissions of 
problematic material, or other typical developments, 



the problems are multiplied. In noting such difficul- 
ties, we are not criticizing the authors of the annota- 
tions, but rather observing the value of being able to 
put multiple annotations into a common framework. 

Once this common framework is established, via 
translation of all eight "strands" into AG graph 
terms, we have the basis for posing queries that 
cut across the different types of annotation. For 
instance, we might look at the distribution of Tilt 
parameters as a function of ToBI accent type; or 
the distribution of Tilt and ToBI values for initial 
vs. non-initial members of coreference sets; or the 
relative size of Tilt FO-change measures for nouns 
vs. verbs. 

We do not have the space in this paper to dis- 
cuss the design of an AG-based query formalism at 
length - and indeed, many details of practical AG 
query systems remain to be decided - but a short 
discussion will indicate the direction we propose to 
take. Of course the crux is simply to be able to put 
all the different annotations into the same frame of 
reference, but beyond this, there are some aspects 
of the annotation graph formalism that have nice 
properties for defining a query system. For example, 
if an annotation graph is defined as a set of "arcs" 
like those given in the XML encoding in §0, then 
every member of the power set of this arc set is 
also a well-formed annotation graph. The power set 
construction provides the basis for a useful query 
algebra, since it defines the complete set of possible 
values for queries over the AG in question, and is 
obviously closed under intersection, union and rela- 
tive complement. As another example, various time- 
based indexes are definable on an adequately time- 
anchored annotation graph, with the result that 
many sorts of precedence, inclusion and overlap rela- 
tions aje_ea£^_to_cak2ujB 1 te_Jor arbitrary subgraphs. 
See ( Bird and Liberman, 1999] , §5) for discussion. 

In this section, we have indicated some of the ways 
in which the AG framework can facilitate the anal- 
ysis of complex combinations linguistic annotations. 
These annotation sets are typically multivocal, both 
in the sense of covering multiple types of linguistic 
information, and also in the sense of providing multi- 
ple versions of particular types of analysis. Discourse 
studies are especially multivocal in both senses, and 
so we feel that this approach will be especially help- 
ful to discourse researchers. 

4 Conclusion 

This proliferation of formats and approaches can be 
viewed as a sign of intellectual ferment. The fact 
that so many people have devoted so much energy to 
fielding new entries into this bazaar of data formats 
indicates how important the computational study of 
communicative interaction has become. However, 
for many researchers, this multiplicity of approaches 
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Figure 10: Visualization for BU Example 



has produced headaches and confusion, rather than 
productive scientific advances. We need a way to 
integrate these approaches without imposing some 
form of premature closure that would crush experi- 
mentation and innovation. 

Both here, an d in associated work (Bird and 
Liberman, 1999), we have endeavored to show 
how all current annotation formats involve the 
basic actions of associating labels with stretches 
of recorded signal data, and attributing logical 
sequence, hierarchy and coindexing to such labels. 
We have grounded this assertion by defining 
annotation graphs and by showing how a disparate 
range of annotation formats can be mapped into 
AGs. This work provides a central piece of the 
algebraic foundation for inter-translatable formats 
and inter-operating tools. The intention is not 
to replace the formats and tools that have been 
accepted by any existing community of practice, 
but rather to make the descriptive and analytical 
practices, the formats, data and tools universally 
accessible. This means that annotation content 
for diverse domains and theoretical models can 
be created and maintained using tools that are 
the most suitable or familiar to the community in 
question. It also means that we can get started 
on integrating annotations, corpora and research 
findings right away, without having to wait until 
final agreement on all possible tags and attributes 
has been achieved. 

There are many existing approaches to discourse 
annotation, and many options for future approaches. 
Our explorations presuppose a particular set 
of goals: (i) generality, specificity, simplicity; 
(ii) searchability and browsability; and (iii) 



maintainabi lity and durability. These ar e discussed 



in full in ( |Bird and Liberman, 1999[ §6). By 
identifying a common conceptual core to all 
annotation structures, we hope to provide a 
foundation for a wide-ranging integration of tools, 
formats and corpora. One might, by analogy to 
translation systems, describe AGs as an interlingua 
which permits free exchange of annotation data 
between n systems once n interfaces have been 
written, rather than n 2 interfaces. 

Although we have been primarily concerned with 
the structure rather than the content of annota- 
tions, the approach opens the way to meaningful 
evaluation of content and comparison of contentful 
differences between annotations, since it is possible 
to do all manner of quasi-correlational analyses of 
parallel annotations. A tool for converting a given 
format into the AG framework only needs to be 
written once. Once this has been done, it becomes 
a straightforward task to pose complex queries over 
multiple corpora. Whereas if one were to start with 
annotations in several distinct file formats, it would 
be a major programming chore to ask even a simple 
question. 
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